生成式A.I.(AIGC)從0開始 - Chroma 向量資料庫使用 - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

2023 iThome 鐵人賽

DAY 13

AI & Data

2023 AI大型語言模型之旅 - 從0開始學習建構AI專案系列第 13 篇

生成式A.I.(AIGC)從0開始 - Chroma 向量資料庫使用

15th鐵人賽 vector vector database 向量資料庫 chroma

shrine90459

2023-09-28 20:04:51

6334 瀏覽

分享至

今天來實際用看看Chroma
它是一個向量資料庫，
根據官方介紹它有以下特點

Simple 5秒內快速使用
Free 開源免費
Integration 整合了很多套件，像是Langchain、LlamaIndex

那到底多簡單呢？下面讓我們跟著一起來探索

1. 安裝

pip install chromadb

因為pytorch的關係所以現在不支援python3.11 如果你是3.11的需要再安裝一個3.10的環境

2. 初始化Chroma client

import chromadb
chroma_client = chromadb.Client()

3. 創建collection

用來存embedding、文件、metadata的地方，可以創建很多個Collection每個都是獨立的，用名字來區分

collection = chroma_client.create_collection(name="my_collection")

4. 新增一些文本訊息

把資料存進collection中，並且Chroma會自動處裡tokenization, embedding, indexing的部分

collection.add(
    documents=["This is a document", "This is another document"],
    metadatas=[{"source": "my_source"}, {"source": "my_source"}],
    ids=["id1", "id2"]
)

如果你有自己的embedding模型，也可以處理好然後直接存向量如下

collection.add(
    **embeddings=[[1.2, 2.3, 4.5], [6.7, 8.2, 9.2]],**
    documents=["This is a document", "This is another document"],
    metadatas=[{"source": "my_source"}, {"source": "my_source"}],
    ids=["id1", "id2"]
)

5. 從Collection做相似查詢

query_texts 你要查的訊息，一樣的他會自動轉成向量並和你的資料做相似查詢
n_results 是返回幾筆結果

results = collection.query(
    query_texts=["This is a query document"],
    n_results=2
)

查詢結果 : 
{'ids': [['id1', 'id2']], 'embeddings': None, 'documents': [['This is a document1', 'This is another document1']], 'metadatas': [[{'source': 'my_source'}, {'source': 'my_source'}]], 'distances': [[0.8226760625839233, 1.0733070373535156]]}

到這邊我們已經完成完整的向量相似查詢了 ! 是不是很簡單呢

要注意，當你第二次執行的時候不能用 collection = chroma_client.create_collection(name="my_collection")

因為這是創建的指令你應該用 get_collection()

collection = chroma_client.get_collection("my_collection")

但現在資料只是暫存而已，無法保存，當你把程式關掉後就都沒了。
這時我們就必須用到下面的方法

6.保存collection資料

persist_directory 看你想要保存在哪，這裡其實就是修改我們步驟2在後面增加

from chromadb.config import Settings
client = chromadb.Client(Settings(
    chroma_db_impl="duckdb+parquet",
    persist_directory="/path/to/persist/directory" # Optional, defaults to .chromadb/ in the current directory
))

下面是官方特別提到的注意事項